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STATISTICAL  METHODS,  SOME  OLD,  SOME  NEW 
A  TUTORIAL  SURVEY 

Donald  P.  Gaver 

This  report  is  a  revised  and  expanded  version  of  material 
on  statistics  presented  at  the  Summer  School  on  Remote  Sensing  in 
Meteorology,  Oceanography,  and  Hydrology,  held  at  the  University 
of  Dundee,  Scotland  during  the  month  of  September  1930;  its  direc- 
tors were  Professor  A.  P.  Cracknell  and  Dr.  G.  Ostrem. 

The  attendees,  both  students  and  faculty,  were  from  many 
countries  and  professional  and  educational  backgrounds.   There 
were,  for  example,  physicists,  electrical  engineers,  atmospheric 
and  geophysical  scientists,  physical  geographers,  photographers-- 
and  two  statisticians:   Dr.  Ed  Wegman  of  ONR,  Washington,  and  Dr. 
Donald  Gaver  of  the  Naval  Postgraduate  School,  Monterey,  California 
The  present  report  was  largely  assembled  by  D.G.,  with  inputs  from 
E.W.   A  shortened  version  was  placed  on  transparencies  and  formed 
the  basis  for  two  one-hour-plus  lectures. 

Those  who  attended  these  lectures  appeared  to  be  quite  posi- 
tive in  their  response.   It  will  be  noted  that  no  attempt  was  made 
to  present  much  formal  mathematical  material,  perhaps  to  the  lis- 
teners' surprise.   In  view  of  the  background  of  most  participants, 
this  seemed  appropriate.   An  attempt  was  made  to  lead  the  listeners 
into  some  of  the  approaches  and  concerns  of  the  data  analyst; 
particularly  one  who  might  be  working  with  "environmental"  data. 
In  the  course  of  talking  to  participants  during  the  school,  and 
particularly  following  these  lectures,  I  discovered  that  there 


was  a  good  deal  of  interest  in  statistical  methodology  for  use 
in  the  atmospheric  and  geophysical  sciences.   I  hope  to  follow 
up  on  some  of  the  contacts  made,  and  perhaps  engage  in  collabora- 
tive activities.   There  seem  to  be  many  who  are  interested,  and 
much  to  be  done  in  applying  statistical  thinking  and  methodology 
to  the  various,  generally  environmental,  geophysical,  or  atmos- 
pheric areas  of  interest  represented  at  this  exciting  and  stimu- 
lating summer  school. 


STATISTICAL  METHODS,  SOME  OLD,  SOME  NEW 
A  TUTORIAL  SURVEY 

Donald  P.  Gaver 


1.   Introduction 

The  purpose  of  the  two  lectures  with  the  above  title  is 
to  introduce  to  some,  and  review  for  others,  selected  topics  in 
statistical  methods.   It  is  hoped  that  these  topics  will  be  use- 
ful to  workers  in  remote  sensing.   Few  details  will  be  given, 
but  references  will  be  provided  so  that  those  interested  can  go 
further. 

Statistics  can  be  tentatively  defined  as  the  science, 
technology,  and  art  of  drawing  quantitative  inferences  from  data. 
The  subject  is  always  in  a  process  of  development,  urged  by  the 
needs  of  those  who  have,  or  plan  to  obtain,  data,  and  further 
stimulated  by  conceptual  developments  and  the  widening  possibili- 
ties of  using  digital  computers  for  storing,  analyzing,  displaying 
and  finally  understanding  the  meaning  of  that  data. 

1.1.   Phases  of  a  Statistical  Inquiry 

It  seems  useful  to  distinguish  several  phases  of  a 
statistical  inquiry 

o  Objectives  of  the  inquiry, 

o  Data  acquisition/ 

o  Data  exploration  , 

o  Model  choice  or  construction , 

o  Confirmatory  or  validational  analysis  > 

o  Communication  of  results • 


Our  term  data  acquisition  subsumes  the  choice  of  what  data  to 
take  in  a  particular  problem  setting,  experimental  design  for 
economy  and  avoidance  of  bias,  etc.   This  topic  is  crucially 
important,  but  for  this  audience  requires  a  substantive  knowledge 
of  remote  sensing  issues  that  we  do  not  yet  possess.   It  will  not 
be  discussed  here.   By  data  exploration  is  meant  the  activity  of 
examining  data,  both  graphically  and  through  numerical  summaries, 
for  the  purpose  of  revealing  properties  of  the  data  itself,  and, 
with  luck,  of  the  process  giving  rise  to  that  data.   John  Tukey, 
e.g.  (1977)  has  shown  us  that  there  is  much  to  value  and  learn 
about  exploration  of  data  before,  or  even  without,  the  use  of  the 
probability  theory  structure  that  has  traditionally  seemed  so 
necessary  for  formal  statistics.   We  will  discuss  several  of  these 
simple  exploratory  approaches  in  the  first  lecture.   Model  choice 
or  construction  is  the  phase  of  inquiry  that  brings  to  bear  sub- 
ject matter  knowledge  to  "explain"  data  behavior.   Here  probability 
theory  may  well  enter  to  represent  measurement  errors,  or  fluctua- 
tions in  natural  phenomena,  such  as  the  heights  of  wind-driven 
water  waves.   The  confirmatory  analysis  phase  is  concerned  with 
developing  quantitative  expressions  for  uncertainties  inherent 
in  simple  summary  statements,  and  in  more  complex  characteriza- 
tions and  predictions  made  from  formal  models.   This  is  the  formal 
inference  phase  to  which  mathematical  (especially  probability- 
theoretic)  methods  continue  to  contribute. 


2.   Simple  Graphics  and  Summaries  of  Data 

2.1.   Stem  and  Leaf  (Successor  to  Histogram) ;  Transformation 

Suppose  we  have  a  batch  of  measurements  on  some  physical 
quantity.   We  wish  to  display  these  numbers  in  f requency-of- 
occurrence  style,  losing  as  little  detail  as  possible. 

Here  is  such  a  batch: 

0.92,  1.14,  2.00,   2.66,  6.52,  4.95,  2.76,   2.13,  1.14,  9.72 
11.24,  0.25,.  3.31,  12.66,  0.46,  2.77,  0.35,  66.67,  0.59,  1.84 

To  get  a  stem-and-leaf  display  draw  a  vertical  line  (stem)  and 
decide  on  reasonable  class  intervals:   first  try  intervals 
0  ■*  10-,   10  -*  20-,   20  •*■  30-,   etc.   Now  put  the  integer  part 
of  the  number  by  convention  (rounded  to  the  nearest  integer)  to 
the  right  of  the  stem,  e.g.  0.92  becomes  1,   1.14  becomes  1, 
...6.52  becomes  7,  etc.   The  stem  and  leaf  appears  as  follows: 
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At  the  end  there  is  a  leaf  of  entries  attached  to  the  stem  at  0 , 
1,  and  7.   Imagine  that  each  leaf  entry  (single  integer)  occupies 
one  unit  of  area,  and  hence  count  frequencies  are  proportional 
to  areas,  exactly  as  in  the  conventional  histogram.   Note  that 
greater  detail  concerning  number  identify  is  preserved  by  the 
stem-leaf  than  is  managed  by  the  conventional  histogram. 


Stem-leaf  displays  are  usually  constructed  by  making  a 
convenient  but  informal  choice  of  leaf  interval  (here  it  was  10 , 
but  for  more  detail  it  could  be  5,  and  for  more  smoothness,  hence 
less  detail,  20)  .   A  more  formal  approach,  see  Scott  (1979)  ,  is 
that  of  picking  the  bin  size,   h  ,   so  as  to  minimize  an  integrated 
mean-squared  error  of  estimate  of  the  true  density  by  the  histo- 
gram.  In  summary,  if  the  data  is  approximately  Gaussian  then 
the  prescription  turns  out  to  be 

3.49  s 
h   =  

n      r~ 
vn 

where   h   is  bin  size  for  a  batch  of   n,   and  where   s   is  the 
n 

sample  standard  deviation.   Observe  that  a  few  extreme  outliers 

in  an  otherwise  Gaussian-appearing  batch  will  unjustifiably  expand 

s,   and  over-coarsen  the  histogram.   A  robust  estimate  of   a, 

such  as 

£  _  (Upper  Quartile) - (Lower  Quartile) 
°  1.35 

might  then  be  recommended  instead  of   s. 

A  second,  more  basic,  feature  of  the  raw  data  and  the 
display  is  the  crowding  in  the  first  class  interval:   numbers 
like   0**   are  all  jammed  together,  and  there  is  evidence  of 
systematic  (right)  skewness.   In  such  cases  transformation  of 
the  basic  numbers  before  plotting  is  often  useful,  and  in  this 
particular  case  a  first  attempt  might  be  with   logarithms  (base 
e  =  2.7182...,   but  the  particular  base  doesn't  matter).   Here 
are  the  logged  numbers: 


-0.083,   0.13,  0.69,  0.98,  1.87,  1.60,   1.02,  0.76,   0.13,  2.27 

2.42,   -1.39,  1.20,  2.54,  -0.78,  1.02,  -1.05,  4.20,  -0.53,  0.61 

Here  we  might  try  a  scale  of  0.5,  so,  rounding  to  the  nearest 
tenth  of  an  integer: 

Ranges  Stem-Leaf 
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Notice  that  the  previous  crowding  and  skewness  has  nearly 
disappeared,  and  we  see  revealed  the  vestigial  appearance  of 
two  separate  "humps,"  as  well  as  the  originally  apparent  outlier 
(£n  66.67  ~  4.2).   We  are  led  to  investigate  the  possibility  that 
the  data  derive  from  two  separate  sources,  with  one  exotic  (mis?) 
measurement  thrown  in  for  good  luck.   In  this  form  the  data  urges 
more  upon  us  than  it  did  originally. 
2.2.   Number  Summaries 

Traditionally  it  has  been  customary  to  summarize  certain 
features  of  batches  of  measurements  by  moments :   the  (arithmetic) 
mean  gives  the  "location"  of  the  data  set,  the  standard  deviation 
summarizes  "spread"  (or  "scale,"  or  "width"),  the  third  central 
moment  measures  "skewness,"  and  so  on.   However,  certain  apparently 
more  primitive  measures  are  useful  and  have  virtues. 


Work  with  the  observations  after  ordering  them  into 

x(l)  <X(2)  <x(3)  <  '"   <x(n)  * 

a)   Median,   M. 

Compute  the  median  index  m  =  (l+n)/2.   If   n   is  odd 
x,  *   is  the  middle  (single)  number,  and  if   n   even  then  average 
the  two  middle  numbers.   Thus  the  median  is 


M 


t   x,  ,  if   n   is  odd  , 

(   (m) 


2(x([m])  +x([m]+l)>   if  n   is  even  • 

where   [m]   is  the  integer  part  of   m.   Notice  that  quite  radical 
changes  in  extremes,  e.g.   x/i\   or  x(    )'      effects  M  not  at 
all;  in  fact  changes  from  x,  v   to   1000  x,  ,,   creating  a  very 
isolated  single  observation,  leaves  M   alone,  so  M   is  resistant 
(un-influenced)  by  such  changes,  or  occurrence  of  outliers.   On 
the  other  hand,  the  familiar  mean  responds  dramatically  and  unde- 
sirably and  is  far  from  resistant.   Thus  the  median  is  a  useful 
candidate  for  "placing"  or  "locating"  a  rather  concentrated  batch 
of  points  when  several  exotic,  separated,  extremes  are  present; 
the  mean  is  apt  to  strike  a  meaningless  compromise.   Of  course, 
the  median  does  not  well-summarize  a  batch  having  two  or  more 
distinct  but  nearly  equal-sized  humps.   But  certainly  it  does  no 
worse  than  the  mean.   Later  we  discuss  some  better  measures,  and 
also  some  for  spread  (alternatives  to  standard  deviation) . 


(b)   Quartiles:   Lower  =  Q,   Upper  =  Q 

Roughly  one-quarter  of  the  observations  fall  below  Q 
(above  Q) .   Define 

q  =  |(1+  [m])  . 


Then 


!x,  .   if   [m]   is  odd,  as  c[  is  integer  , 
!(X([q]>  +X([q]+1))   lf   [ml   iS  eV6n  • 

Q   is  defined  analogously  in  terms  of  n-q  +  1.   In  other  words 
Q(Q)   is  the  median  of  the  lower  (upper)  half  of  the  ordered 
batch. 

(c)   Eighths:   Lower  =  E,   Upper  =  E 

About  one-eighth  of  the  observations  fall  below  E  (above 
E) .   Let 

e  -  |(1+  [q])  . 

Then 


x,    .   if   [q]   is  odd  , 
(e)        iJ 


E  = 


l[x(rel)  +x(rel+l))   if   [q]   is  even  ' 


2l~([e])   A([e]+1) 
and  E   is  defined  analogously  in  terms  of   n-e  +  1. 


(d)   Extremes 

There  are  simply 


Ext  =  x 


(1)  ' 


Ext  =  x 


(n)  * 


A  seven-number  summary  of  the  data  is  as  follows 


Ext 


M 


ill]    EMQ 


E  +  E 


=    ME 


Ext  +  Ext 


=    MExt 


Q 


Ext 


MQ  -  M 
SQ  Q-Q 


ME  -  M 
E  E  -  E 


MExt  -  M 


Ext 


Ext-Ext 


The  quartile  (eighth,  extreme)  means   MQ(ME,MExt)  can  be  quickly 
compared  to  M   to  detect  systematic  asymmetry  or  skewness. 
Dimensionless  measures  of  skewness  are  also  given  by  the  quanti- 
ties   sQ,   sE,   sExt. 


Example:   x/i\  =  1/ 

x(6)  =  111. 


X(2)    3' 


x(3)=  5'   X 


(4) 


7   X(5)  =  9' 


Then   m  =  2"d+6)  =3.5,   so 


M  =  | (5+7)  =  6  . 


Note  that  the  mean   x  =  22.67.   This  neither-f ish-nor-fowl 
number  has  responded  heavily  to  the  isolated  value  111.   Next 


8 


so 


Now 


so 


q  =  ±(1  +  [3.5])  =  2  , 


Q  =  3,    and    Q  =  9. 


e  =  |(1+  [2])  =  1.5  , 


E  =  |d+3)  =  2,    and   E  =  |(lll+9)  =  60 


The  number  summary  then  is 


31 


56 


60 


111 


Q 


=  0 


SE  = 


Ext 


30  - 

6 

60  - 

2 

56  ■ 

-  6 

111  - 1 


=  0.43 


=  0.45 


The  numbers  suggest  positive  skewness,  but  closer  examination 
reveals  that  this  is  caused  by  the  influence  of  one  point  alone 
Note  that  i_f  the  data  is  a  sample  from  the  Normal  distri- 
bution with  standard  deviation  o,      then  convenient  approximate 
estimates  of  a      are 


<°-2>  1735  - 


=  o, 


and     (E  -  E)  2730  =  °E 


E  -  E 

and  thus    ~  1.70 

Q-Q 


rw 


if  data  are  approximately  Normal;  this  is  a  handy  check.   An 
up-to-date  test  for  precise  Gaussian  behavior  is  that  of  Wilk- 
Shapiro  (1965) .   It  involves  a  suitable  linear  combination  of 
ordered  observations  as  a  test  statistic.   A  cautionary  note: 
critical  values  of  the  test  statistic  (rejection  region)  are 
obtained  on  the  basis  of  independence  of  batch  data  values, 
often  not  an  acceptable  assumption  in  practice. 

2.3.   Box  Plots 

The  box  plot  is  a  picture  of  the  five-number  (omit  Eighths) 
summary.   Simply  draw  a  rectangular  box  with  ends  at   Q   and  Q, 
and  in  which  M   is  marked.   In  addition,  connect  up  the  extremes. 
As  was  done  with  stem-leaf  we  arrange  the  "box"  vertically. 
Embellishments  are  sometimes  useful:   (i)   it  has  been  recommended 
that  box  width  be  proportional  to   /n   to  indicate  effect  of  sample 
size  when  comparing  different  batches,  (ii)  warning  points  at 
Q  + 1.45  (Q-Q)   and  Q-  1.45  (Q-Q),   something  like  "fences"  of 
Tukey:  if  the  data  are  Gaussian  there  is  only  about  one  percent 
of  the  data  "outside"  these  values.   Box  plots  are  especially  good 
for  giving  a  quick  appreciation  of  comparative  distributional 
behavior  for  different  batches. 


10 


Simple  Box  Plot 
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2.4.       Rooted  Histogram    (  "Roo  to  grains") 

Hark  back  to  the  histogram  over  bins  of  width   h,   with 
n.   observations  in  the  j —  bin.   The  standard  error  of   (n  ./n)  , 
the  frequency  estimate  of 


p .  =  Probability  of  an  observation  in  the  j —  bin 

x   Hi+x.) 
3+1  :  f  (x)dx«f  (x.+h/2)  h 

x . 


is  at  least  under  random  sampling, 


{Pj(l-Pj)/n}1/2  ~  {(iij/n)  (l-nj/njn"1}172 


which  suggests  that  the  magnitude  of  the  sampling  fluctuations 
in  the  high-count  (large  p.)  bins  may  be  much  larger  than  those 
in  the  low-count  bins;  on  the  other  hand,  the  relative  fluctua- 
tion magnitudes  are  greater  in  the  low-count  bins.   The  square- 
root  transformation,  /  n  ./n,   tends  to  stabilize  (equalize)  this 

variation:  /  n ./n      tends  to  estimate   /pT  with  standard  error 
v      3  y     3 

of  about  0.5.   This  suggests  several  possibilities  for  useful 
graphical  analysis: 

(1)   Smoothing  a  raw  histogram  by  smoothing   /n ./n  -  values  and 
re-squaring  the  results,  as  possibly  in 


(/5H/¥-/5-^ 


so  the  estimate  of  p .   becomes 


D,S 


m 
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Other  approaches,  such  as  running  medians,  (see  Tukey 
(1977),  p.  543  ff.)  are  likely  to  be  effective.   The 
attempt  is  to  reduce  sampling  fluctuations  without  resort- 
ing immediately  to  the  "ultimate  smooth:"   a  simple  model. 
(2)   Assessing  model  fit  by  a  hanging  rootogram.   Having  a  model 
distribution  in  mind,  perhaps  suggested  by  theory,  we  wish 
to  graphically  present  the  evidence  for  and  against  it 
using  a  histogram  of  data.   To  this  end,  plot 

/nT  , 

A  .  =  /pT  -   /  — ■*-   vs    x .  +  ~-  ; 
3     FJ    /  n  D    2 

this  is  equivalent  to  looking  at  a  plot  of  the  square  root 
of  the  density,  with  roots  of  the  histogram  values  hanging 
from  it.   If  the  model  is  in  basic  agreement  with  the  data 
then  about  95%  of  the  values  of  the  differences   A .   should 
lie  within  unit  distance  of  zero.   These  differences  "should" 
also  be  nearly  pattern-free  as  order  goes.   Departures 
vividly  show  exactly  where  the  model  and  the  data  disagree. 
Thus  the  procedure  has  a  focus,  and  a  specific  diagnostic 
slant.   It  can  clearly  be  used  for  histograms  in  more  than 
one  dimension. 

The  above  idea,  also  due  to  Tukey  can  be  extended  to  supply 
a  formal  test  of  goodness  of  fit  analogous  to  the  classical  Chi- 
squared  test.   At  the  moment  the  concern  is  with  graphically 
exposing  data  features  and  with  indications  of  model-data 
discrepancies. 
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3.   More  Summary  Measures 

In  Section  2  we  worked  with  the  median,   M,   as  a  summary 
measure  of  location,  and  with  the  midspread  Q-Q   as  a  summary 

measure  of  distributional  spread.   Of  course  the  classical  measures 

/l n ~^~2 

would  be  the  mean,   x,   and  standard  deviation,   s(=  A—*-  \    (x.-x)  ) 

/  n-l  . u -       1 
i=l 

The  latter  are  suspect  because  of  their  sensitivity  to  occurrence 
of  a  few  wild  values — perhaps  recording  errors.   Here  are  some 
currently  well-regarded  alternatives. 

3.1.   Winsorized  mean;  confidence  limits 

Suppose  a  set  of  nearly  symmetrically  distributed 
observations  are  in  hand.   If  a  small  number  of  outlying  observa- 
tions are  feared  one  can  diminish  their  impact  by  symmetrically 
Winsorizing  to  level  g  (g <  0.20n,   n  being  number  in  the  batch) : 

(a)   Order  the  batch 


x(l)  <  x(2)  <  '  *  *  <  x(g)  <  X(g+1)  <  *  '  *  *  X(n-g)  <  x(n-g+l)  <•••■<  x(n)  * 


(b)   Winsorize,  level   g: 

y(l)  =  Y(2)    *••   y(g)  =  x(g+l)  <Y(g+l)  =  x(g+l)  <y(g+2)  *  •"" 

Y(n-g)  =  X(n-g)  =  Y(n-g+l)  =  y(n-g+2)  =  • • •  =  Y (n) ' 

In  other  words,  define  the   g   smallest  in  the  Winsorized  batch 

(y's)   to  equal  the   g —  smallest  raw  data  point,   x,  , .  Do 

the  same  symmetrically  at  the  upper  end.   The  result  will  be  tied 
values  (g   at  bottom,   g  at  top) . 
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(c)      Average,    to   get   a      g-Winsorized  mean: 


n 

_    i_ 

kWg 


1       V 

n    .  L,    n 


i=l 


Notice  that  if   n   is  odd  and  you  Winsorize  to  the  extent 

n  —  1 

g  =  — =—  the  result  is  precisely  the  median.   The  Winsorized 
mean  can  be  viewed  as  a  broadened  median,  with  a  median's  vir- 
tues (oblivious  to  outliers) ,  but  that  uses  more  of  the  sample 
information.   Winsorizing  is  named  for  C.  P.  Winsor,  an  insight- 
ful applied  statistician. 

(d)   Confidence  limits  using  the  Winsorized  mean. 

Confidence  limits  attempt  to  express  the  uncer- 
tainty inherent  in  a  particular  estimate  of  a  fixed  quantity 
(such  as  mean  radiation,  mean  visibility,  etc. )  .   I_f  one  is 
estimating  the  mean,   y,   of  a  symmetric  distribution  (possibly 
after  transformation)  ,  and  if_  the  observations  are  independent, 
then  it  is  appropriate  to  use  the  classical  Student's   t   (or  a 
Gaussian  approximation) :   with  confidence  (not  probability)  of 
(1-a)  •  100%, 

x+t  /0(n-l)—  5  y  5  x  +  t,   /0(n-l)  — 
*/2     /H  l-a/2     ^ 

where   t0 (n)   is  the   3  '  100%   point  of  a  Student's   t  with   n 

p 

"degrees  of  freedom;"   x  and   s   are  the  sample  mean  and  standard 
deviation,  respectively.   Such  intervals  are  reasonably  satisfac- 
tory even  if  the  data  are  not  quite  Gaussian,  but  if  there  are 
some  extreme  outliers,   s  blows  up  and  the  intervals  are  much 
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too  wide.   One  should  also  compute  the  Winsorized  limits:   after 
computing   Xt^*   find 


2      1    r   ,      -   x2 

sWg  "  H^T  i£1  (Y(i)  XWg; 


and  then  find  the  confidence  limits 

=Wg  + 'a^'""2'-1'^  =  "  ^  5Hft+(=SFr)tl-0/2«»-2'-1»^  • 

These  may  be  decisively  narrower;  if  so,  look  for  outliers.   Pick 
a  succession  of   g-values,  but  not  to  exceed  0.20n. 

For  more  details,  see  Dixon  and  Tukey  (19  6  8)  and  Dixon  and 
Yuen  (19  73)  • 

3.2.   Biweights 

The  Winsorizing  procedure  has  to  a  degree  been  supplanted 
by  a  procedure  called  biweights.   This  involves  an  iterative  least- 
squares  calculation  with  weights  on  individual  observations  that 
decrease  as  the  degree  apparent  non-representativeness  (exoticizm) 
of  a  data  point  increases.   Let   x,,x,,...,x   be  data  points; 
then  here  is  a  recipe  for  fitting  a  resistant  center  via  biweights: 
referring  to  the   k  +  1 —  iteration  in  terms  of  the   k — , 


n 

V   w.  (k)  •  x. 

i=l   x      x 
x(k+l)  =  ^^ 

I      w  (k) 
i=l   1 


where  the  weight  of  observation   i   is 
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wi(k) 


n2 


1  - 


x.  -  x(k) 
1 

cs. 


if 


x.  -  x(k) 

cs. 


<  1 


L  o 


otherwise. 


and    s,    is  a  spread-measure,  perhaps 


s,  =  median{  |  x.  -  x  (k)  |  } 


the  so-called  MAD  or  median  absolute  deviation,  or  alternatively 


sk  =  Qk  "  2k 


the  midspread  of  the  residuals  at  the  k —  stage.   These  estimates 
are  essentially  equivalent.   Here   c   is  a  tuning  parameter;  a 
value  of   c  =  6  or  9  has  been  recommended  for  general  use,  while 
c  =  oo  yields  up  the  OLS  estimates. 
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4.   Relations  Between  Variables,  Models,  and  Model  Assessment 

The  discovery  and  establishment  of  quantitative  relation- 
ships between  measurable,  or  classifiable,  variables  is  a  primary 
scientific  concern.   Once  a  relationship  is  uncovered  it  is  soon 
likely  to  be  thought  to  be  potentially  useful  for  some  human 
purpose,  and  attempts  will  be  made  to  apply  it.   Many  examples 
exist  in  the  remote  sensing  area.   Of  course  the  quality,  or 
strength,  of  the  relationship  must  be  such  as  to  make  its  appli- 
cation feasible.   Statistical  methods  and  thinking  are  frequently 
used,  sometimes  very  informally,  to  help  in  finding  a  relationship 
and  to  expressing  it  in  succinct,  often,  formally  mathematical 
terms  (as  a  mathematical  model),  to  characterizing  its  deficiencies 
or  biases,  and  to  expressing  the  uncertainties  inherent  in  its 
use.   Here  is  a  brief  review  of  some  of  these  methods. 

4.1.   Graphical  Plotting  for  Relationship  Exploration 

Many  relationships  between  variables  are  initially 
guessed  from  scientific  theory;  or  at  least  from  subject-matter 
knowledge,  intuition,  or  low  cunning.   When  relevant  data  becomes 
available  the  obvious  first  step  is  to  plot  it  graphically,  if 
possible.   This  is  easy  if  the  relationship  of  interest  is  between 
two  variables:   x,   an  explanatory  variable,  or  condition  (some- 
times called  factor,  or  lately  carrier)  and  y  =  f (x) ,   a  response 
Suppose  one  has  observed  pairs  of  these  variable  values: 
y . ,x. ;i=l ,2 , . . . ,n.   Then  plots  of   f(x.)  vs  x.   and  y.  vs  x. 
on  the  same  graph  (e.g.  rectangular  coordinates)  are  sometimes 
presented  for  comparison.   Since  the  range  of   f (x) — and  y — may 
be  considerable,  such  plots  often  obscure  real  systematic  differ- 
ences, it  is  often  wise  to  plot  and  study  residuals 
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r.  =  y .  -  f  (x.  ) 
1   J 1  1 


either  vs   x.,   or   vs   y.  =  f(x.);   plotting  residuals   vs 
estimated  values   y.   is  one  of  the  obvious  ways  of  evaluating  a 
relationship  depending  on  several  explanatory  variables.   Examina- 
tion of  a  plot  often  reveals  some  systematic  departure  from 
linearity,  such  as  a  definite  curvature  may  become  evident; 
frequently  this  may  be  essentially  removed  by  transformation 
(alternatively  called  re -express ion) .   This  is  a  good  idea,  for 
comparison  of  plots  to  straight  lines  (looking  at  residuals  from 
straight  lines)  seems  well  adapted  to  human  perceptions.   The 
powers   y  ■>  y  (p>l,  or  p<l)   or   y  ■+■  In  y   (zero  power)  are  often 
recommended  for  transformation  to  the  nearly  linear.   Of  course 
theory  (e.g.  the  laws  of  physics ,  such  involving  squares,  etc.) 
may  point  the  way  to  an  approximate  linear  plot.   Then  systematic 
deviations  may  be  identified  and  interpreted  on  their  own  merits. 


19 


4.2.   Relationship  Fitting 

Although  the  mathematical  form  of  the  relationship 
between  response   y   and  explanatory  variable   x  may  suggest  itself 
from  theory,  certain  constants  (parameters)  nearly  always  require 
numerical  determination.   This  means  that  a  mathematical  model 
exists,  and  must  be  fit  to  the  data.   Afterwards,  one  can  look 
at  residuals  for  indications  of  mis-fit,  or  apply  the  model  to 
make  predictions  and  check  out  its  errors.   One  can  sometimes 
segment  the  data,  utilize  a  part  to  fit  the  model,  and  then  use 
the  remainder  to  examine  the  performance  of  the  fitted  model. 
This  latter  procedure  is  called  cross-validation.   Again  residual 
plots  are  desirable  diagnostics. 

The  actual  fitting  (constant, or  parameter,  estimation) 
problem  can  be  carried  out  in  various  ways.   We  illustrate  in 
terms  of  a  postulated  simple  linear  relation  between  response,   y, 
and  explanatory  variable,   x,   i.e. 

y  =  a  +  bx  , 

with   a  and  b   to  be  determined.   Here  are  some  options. 

o      Simple  least  squares 

o      Median,  two-points 

o      Biweights 

Given  a  set  of   (y^x^   data,  least  squares  proceeds  by  minimizing 

the  sum  of  squares 

n  9 

I    (y.-a-bx  )z 

i=l   1     1 
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by  choice  of   a   and  b.   Linear  equations  result  for  estimated 
a,   namely   a  _   and  estimated  b,   namely   £   .   Many  computer 
programs  exist  for  doing  this  problem,  and  also  for  doing  the 
multiple  regression  problem:   fit 


y . ,  j=l,2, . .  .  ,n  by_  ^0  +  ^lxli  +  ^2x2i  +  *  *  *  + 


p  p: 


For  better  or  worse,  any  set  of  data  can  be  conveniently  fitted 

by  such  a  linear  function,  i.e.  the  parameters   3  ,  8..  ,  .  .  .  ,  $ 

can  essentially  always  be  determined  numerically,  whether  or  not 

the  postulated  relationship  makes  sense.   Furthermore,  the  esti- 

mates,   3.,   turn  out  to  be  linear  functions  of  the   y.   values. 
1  J  j 

This  suggests  that   $.   values  also  respond  linearly  to  changes-- 
perhaps  even  unfortunate  or  unsuspected  discrepant  values  or 
errors — in  the  y.   values,  and  so  a  few  funny  values  can  give 
a  warped  picture: 


x 


x 


x 


x 


Least-Squares  Line 
x 


The  fitted  line  may  not  follow  the  main  data  cluster,  but  instead 
fasten  itself  to  a  single  exotic  point.   For  some  discussion  of 
this  and  remedies  see  Mosteller  and  Tukey  (19  77)  and  also  Belsley, 
Kuh,  and  Welsch  (1980) . 

The  second  method  mentioned  ("median,  two-points") 
avoids  some  of  the  above  difficulties.   Prescription:   order  the 
explanatory  variables  from  smallest  to  largest:   x ,  .   <x,2>  ... 
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'{n/3)<xln/M)<"'      X(2n/3)<-"   x(n)  '   Call  the  set  of 
values   x(1),x(2),...  x(n/3)   the  low  group,  and  x(2n/3).'x(2n/3+l) 
...  x,  v,   the  high  group.   Let   x^   be  the  median  of  the  low 
group,  and  x   be  the  median  of  the  high  group.   To  each  member 
of  the  low  group  there  is  a  y;   let  y   be  the  median  of  the 
corresponding  low  y's,   and  y.   be  the  median  of  the  high   y's. 
Now  fit: 


b 


yh"yil 


MED    ^  -  xl 

aMED    Yh  ~  bMED  Xh 

This  procedure  is  a  modification  of  an  ancient  procedure  that 
utilizes  means  instead  of  medians.   Since  medians  are  more  resistant 
to  outliers  than  means,  this  procedure  is  a  move  in  the  right  direction 

The  third  option  ("biweights")  is  an  iteratively  weighted 
least  squares  approach.   It  has  been  programmed  for  many  computers, 
including  evan  a  hand-held  TI-59.   This  procedure  develops  a  weight 
for  each  observation;  the  weight  diminishes  as  the  observation 
becomes  apparently  discrepant,  as  in  the  earlier  discussion. 
Details  are  omitted  here. 

Example.   Sea-surface  wind  speed  and  white-cap  coverage. 

It  has  been  proposed  by  Monahan  (19  71)  that  magnitudes 
of  sea-surface  winds  and  white  caps  are  strongly  related.   Pre- 
sumably white -cap  coverage  can  be  assessed  remotely,  and  wind 
speeds  deduced  there  from.   A  set  of  data  offered  by  Toba  and 
Chaen  (19  71)  has  been  analysed  using  the  above  methods.   An 
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appealing  functional  form  is  the  power  law 

3 
y  =  ax   ; 

it  has  been  argued  on  theoretical  grounds;   3   is  supposed  to  be  in 

the  neighborhood  of  3.   A  computer  scatter  plot  shows  that  this  is 

1/3 
plausible.   An  initial  cube-root  transformation   (y  '   vs  x, 

3 
equivalent  to  the  model   y  =  ax  )   seems  to  straighten  the  plot 

reasonably  well,  but  note  that  a  few  zero  values  occur;  other 

outliers  are  less  obvious. 

It  is  natural  to  try  to  fit  the  log-transformed  version 

of  the  power  law,  for  this  is  now  linear: 

In  y  =  £n  a  +  3  &n  x 

Here  are  some  results  (OLS  means  Ordinary  Least  Square,   Bi(i) 
means  the  biweight  result  after   i   iterations,   s.   means  robust 
scale  after   i   iterations) ;  note  that  it  has  been  necessary  to 
fit  in (y + start) ,   start  =  0.001   here,  in  order  to  avoid  the 
embarrassment  of  logging  exact  zeros.   The  "start"  value  can  be 
chosen  wisely;  no  attempt  to  do  so  is  exhibited  here,  however. 

PARAMETER  FITS  FOR  LOG-LINEAR  MODEL  OF  WHITE  CAPS  vs  WIND 


Method 


OLS  =  B.  (1) 
l 

OLS  (zeros  out) 


Bi(2) 
Bi(4) 
Bi(6) 


Estimate 

in   a 

3 
3.51 

s . 

i 

-9.65 

1.44 

10.31 

4.22 



10.73 

4.24 

1.26 

11.32 

4.68 

1.05 

11.39 

4.25 

1.03 
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Apparently  the  biweight  calculation  yields  quite  different  results 
from  OLS,  even  after  visual  culling  of  apparent  zeros  has  been 
conducted.   It  would  be  of  interest  to  utilize  the  techniques  of 
Cook  (1977)  and  Belsley,  Kuh,  and  Welsch  (1980)  to  further  identify 
influential  observations.   Of  course  the  present  setup  is  simple 
enough  so  that  visual  examination  will  reveal  most  of  what  is 
present.   This  would  not  necessarily  be  so  if  the  explanatory 
variable  were  a  vector. 

Some  computer  plots  are  included  to  show  that  the  biweight 
procedure  provides  fits  that  tend  to  dramatically  reveal  the 
presence  of  outliers.   In  many  cases  the  presence  of  such  outliers 
represents  opportunity  for  new  insight  and  discovery, as  is  pointed 
out  by  Tukey  (19  77)  .   Outliers  should  never  be  immediately  "thrown 
away,"  but  rather  are  candidates  for  special  attention.   See 
Kruskal  (1960)  for  some  wise  discussion  of  this  issue. 
4.3.   Fitting  Probabilities  and  the  Like 

It  is  often  appealing  to  explore  the  relationship  between 
the  probability  of  some  environmental  (or  other)  event  and  some 
reasonable  explanatory  variables.   This  amounts  to  estimating 
conditional  probabilities.   Often  the  events  in  question  are  of 
the  form  "rain,"  "fog,"  "visibility  in  the  range  0-5  kilometers," 
etc.   Candidate  explanatory  variables  may  be  the  product  of  a 
remote  sensing  system,  or  a  numerical  weather  forecasting  scheme 
("model  output  statistics") ,  or  a  combination  thereof.   Some 
attempts  to  use  persistence  ("it  rained  in  Monterey  on  January 
14,  so  the  probability  of  rain  on  January  15  is  *+0.2,   whereas 
without  rain  it  is  *")  are  also  appealing.   In  the  latter  form 
the  simple  idea  of  Markov  chains  has  been  employed. 
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Ordinary  linear  regression  has  been  used  to  predict 
event  probabilities,  e.g.  the  REEP  scheme  of  R.  G.  Miller  (1964) 
This  has  the  aesthetic  difficulty  that  probabilities  are  between 
zero  and  1,  which  ordinary  regression  doesn't  recognize.   Two 
ways  of  dealing  with  probability  regression  problems  are  as 
follows: 

o     logistic  regression 

o    conditional  probabilities. 
In  logistic  regression  one  utilizes  the  simple  model 


a+bx 
P{event  E | explanatory  variable  X=x}  =  p(x)  =  — 


,    a+bx 
1  +  e 


where   a   and  b   are  constants  to  be  determined.   Although  this 

model  has  been  written  in  terms  of  one  explanatory  variable, 

multiple  regression  options  are  available.   One  can  contemplate 

fitting  the  logistic  model  in  several  ways: 

(i)   By  maximum  likelihood.   Suppose  the  explanatory 

variable  takes  on  value   x.   on  the  i —  occasion,  and  a  success 

i 

(event,  e.g.  rain)  occurs,   i  =  l,2,...,n  happenings  occur. 

Then  the  likelihood  of  the   a,b   using  the  data   (6. ,  x . , 

no  n  i.  r  fl      if  event   occurs         .      „^„    .n  Kq 

i=1,2 ,.. .n) ,      where      6 .    =     4  /is   seen   to  be 


;.   =    P     1 
1  lo      o 


otherwise 


6./  vl-6. 

n    /      a+bx-:        \     1/  t  \        i 


L<a'b'^    *    ^    1+.«+bxi  ,  +    a+bXi 


i=l\l  +  e  -1  /        \1  +  e 


or 


n  n  a+bx . 

Ma,b;x)    =    Hn  L   =      I    6.  (a+bx.)    -      I    Jln[l+e  1] 

i=l   x  x  i=l 
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Differentiation,  then  solution  for  a,  b   yields  maximum  likelihood 
parameter  estimates.   See  Cox  (19  70) ,  and,  recently,  McCullagh  (19  80) 
Pregibon  (19  90)  has  generated  robust  procedures. 

(ii)   By  grouping.   If  the  data  set  is  large  one  can 

order  the   x*s,   split  the  ordered  values  into  an  equal  number 

of  groups,   g,   calculate  the  frequency  of  successes  in  each 

n  . 
group  p .  =  -7 — -A-y     where  n .   is  the  number  of  successes  in  group 

j   and  n   is  the  total  number  of  observations.   Now  let   x.   be 
J  3 

a  representative  explanatory  variable  value  for  group   j ;   the 
median  may  be  appropriate.   Note  that 


y.  =  £n(p\/(l-p\)j 


vs   a  +  b  x  . 
—        3 


should  be  nearly  linear  if  the  logistic  relationship  holds  (if 
not,  try  a  transformation) .   In  any  case  it  is  linear  in  the 
parameters,  so  a  fit  can  be  readily  made  and  diagnostically 
viewed.   One  can  even  fit  the  above  by  weighted  least  squares, 
or  perhaps  biweights.   Such  procedures  are  under  investigation 
at  the  Naval  Postgraduate  School  by  the  author  and  Dennis  Mar; 
they  seem  promising.   See  Cox  (1970)  ,  especially  Chap.  3,  for 
some  good  discussion.   The  procedure  can  be  made  multivariate 
(x  a  vector) ,  provided  there  is  a  good  deal  of  data. 

Notice  that  the  above  method  handles  only  dichotomous 
situations  ("rain,  no  rain,"  "visibility  in  category  (*),  visi- 
bility in  catagory  (not  (*))").   A  multiclass  version  of  the 
above  has  been  devised  by  P.  Bloomfield,  J.  Lehoczky  and  the 
author  (possibly  also  by  others)  and  is  under  development. 
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Another  approach  to  the  irmlticlass  problem  is  by  conditional 
probabilities.   Suppose   E.   is  the  event  that  an  observation 
(e.g.  of  visibility  state)  lies  in  class   j   ( j  = 1 ,2 , . . . ,C) ,   and 
x.   is  the  corresponding  explanatory  variable  value.   One  can 
construct  an  estimate  of  the  density  function  given  that   E.  has 

: 

occurred,   f(x. |E.),   for  each  class,   j  =  1,2,..., C;   the  histo- 
gram, raw  or  smoothed  (c.f.  Wegman  (1979)),  or  a  simple  model 
(normal,  lognormal,  etc. )   may  do  well.   By  Bayes1  theorem,  then 

f  (x|E  .)P{E  .} 
P{Ej|x}  =  _ 1 2 . 

Y   f(x|E.)P{E.} 
j=l     '^     J 


here   P{E.}   is  the  overall  marginal  probability  of  an  event  in 
class   j,   estimable  from  history.   Such  an  approach  is  under 
empirical  investigation  by  the  author,  utilizing  some  meteoro- 
logical data. 
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5.   Time  Series 

Many  of  the  data  sets  encountered  in  studies  of  the  natural 
environment  have  distinctive  time-series  structure.   This  means 
that  successive  observations  in  time  (and  contiguous  observations 
in  space)  are  likely  to  be  rather  similar  or  correlated  as  a 
result  of  both  (a)  important  natural  phenomena  associate  system- 
atically with  seasons,  years,  geographical  places,  and  forces  of 
nature,  and  also  (b)  what  may  be  considered  to  be  the  haphazard 
superposition  of  a  variety  of  additional  effects,  among  which  may 
be  measurement  error.   These  latter  effects  typically  do  not 
appear  independent  from  observation  to  observation,  a  fact  that 
represents  both  opportunity  and  difficulty  in  statistical  analysis 
In  the  present  section  we  review  a  few  basic  notions  and  concepts 
in  time  series  analysis.   The  topic  is  subtle,  and  deserves  more 
than  it  gets  in  this  discussion. 

5.1.   Components  of  a  Time  Series 

One  way  of  thinking  about  a  time  series,  say  of  monthly 
total  precipitation  at  a  point  in  space,  or  visibility  at  a  point, 
is  in  terms  of  the  simple  components 

o     (apparent)  trend 

o     (apparent)  seasonal  effect 

o     (random)  disturbance  or  noise. 
The  trend  ideally  represents  a  systematic  long-term  change; 
"apparent"  is  appended  because  a  steady  trend  may  actually  be 
quite  impermanent,  giving  way  to  a  new  and  different  trend.   Think 
of  the  economy,  particularly  the  stock  market  indices.   Mean  sea 
levels  measured  by  tide  gauges  are  also  of  this  nature,  possibly 


28 


systematic  effect 


because  a  regime  of  slow,  regular,  changes  in  land  supporting  a 
tide  gauge  may  rather  suddenly  be  supplanted  by  a  different 
regime,  perhaps  the  result  of  human  activity.   Sometimes  a  trend 
may  be  taken  to  be  linear,  at  least  provisionally.   The  seasonal 
effect  often  seems  inevitable:   ordinarily  it  rains  in  the  winter 
in  Monterey,  and  not  in  the  summer.   Tides  behave  in  accordance 
with  time  of  day,  with  a  slow  trend  superimposed.   Left  over  is 
the  contribution  of  random  disturbance  or  noise,  which  represents 
the  joint  influence  of  other  conditions,  including  measurement 
error.   Interestingly,  the  latter  random  component  has  received 
more  sophisticated  mathematical-statistical  attention  than  have 
the  other  components.   Understanding  trends  and  seasonals 
(apparently  systematic  effects)  probably  depends  upon  understand- 
ing the  underlying  subject-matter  area. 
5.2.   Decomposition 

Investigation  of  a  time  series   z .  ,   t  =  1,2,... 
ought  to  begin  by  a  study  of  graphical  display.   The  inter-rela- 
tions between  several  series  is  often  suggested  by  such  an  approach 
Suppose  one  wishes  to  isolate  the  semi -permanent  component  of  the 
series  (trend  and  seasonal) .   This  can  be  approached  in  the 
following  ways: 

o   Fit  a  specific  function  to  the  trend, 

o    Smooth ,  either  non-robustly  or  robustly. 
If  a  plot  shows  a  nearly  linear  trend  (perhaps  after  transforma- 
tion,  e.g.  by  logs)  one  can  fit 

z.   vs   a + bt  t  =  1,2,.  .  .  ,T 
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by  least  squares.   Now  if  the  true  situation  were  well-represented 
by  the  model 

Z   =  a  +  bt  +  e 

with   {£4./  t=l,2,...}   a  sequence  of  non-independent  random 
variables  with,  nevertheless  E[et]  =  0,   var[et]  =  a2,   the 
Ordinary  Least  Squares  is  less  than  perfectly  efficient.   However 
it  is  ordinarily  a  useful  approach — maybe  all  that  is  available. 
Once   aTc'   bLS   are  available  one  can  subtract  away  the  fit, 
leaving  the  residuals 

rt  E  Zt  "  *LS  "  ^LS11  (~  £t}  * 

for  study.   If  the  trend  removal  has  been  effective, and  no  inter- 
vening seasonal  has  occurred,  the   r,  ' s  will  have  roughly  constant 
variation  for  all   t   (box  plot  segments  of  the  series  for  diagnosis) , 
and  a  characterization  or  study  of  the  r*.'s      i-s  i-n  order.   Note 
that  if  the  covariance  function 


f  (T)  =  t~T  l      rt  rt 

t=l   r   r 


T-T 

+  T 


remains  and  diminishes  only  slowly  as   x   increases  this  signifies 
substantial  similarities  between  observations  separated  in  time 
contributed  by  something  (here  it  is  assumed  to  be  the  noise  com- 
ponent, although  it  might  well  be  an  un-removed  contribution  by 
seasonal,  or  a  change  in  trend) .   Often   f (t)   looks  something 
like  this 
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5.3.   Noise  Models 

It  is  convenient  to  employ  simple  parametric  models 
to  describe  noise  behavior.   Here  are  several 

o   Simple  Markov  or  Autoregressive :   e   =  pe   ,  +  a  , 

{a  ,  t  =  l,2,...}   is  "white  noise,"  a  sequence  of 

independently  Gaussian  variables. 

t  +  at-l 


a.  +  a^  ■,  +  ...    a 


o   Simple  Moving  Average:   e   = 


tzE 


p+1 

p  an  integer  >  0. 
o   Autoregressive-Moving  Average  (ARMA  (1,1)): 

£t-p£t-l  =  at-8at-l 

It  is  well  to  start  with  the  simple  Markov,  for  which  one  may 
estimate   p   from  the  residuals  by 


p  =  r  J:  rtrt-i 


The  theoretical  autocovariance  function  of  the  Markov  (AR)  is 


actually 


f  =  c2  p11' 


t   =  0,1,2, 


so  as  a  check   (p)T   should  die  off  exponentially.   A  plot  on 
semi-log  paper  is  helpful  for  a  check. 
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Although  least  squares  may  provide  reasonable  estimates 
for  parameters   a   and  b,   the  attempt  to  use  the  sum  of  the 
squares  of  the  residuals  to  estimate   a2   (noise  variance) ,  and 
then  to  use  this  estimate  in  conventional  formulas  for  standard 
errors  and  confidence  limits  for   a   and   b   is  often  a  bad 
mistake:   the  "standard"  results,  such  as  those  found  in  regres- 
sion packages  at  computer  centers,  are  likely  to  give  wildly 
optimistic  (falsely  precise)  results.   Such  effects  have  been 
noticed  by  Abreu  (1980)  in  an  investigation  of  sea  level  trends 
on  the  U.S.  Pacific  Coast. 

Note  that  if_  a  linear  trend  has  been  fitted  by  OLS , 
and  the  residuals  appear  to  be  AR(1) ,   then  the  variance  of  the 
slope  term  estimate,   ^qls'   ^s  aPProximately  that  given  by  the 
OLS  formulas  appropriate  for  purely  random  noise,  multiplied  by 
(1±P 


For  more  detail  see  Bloomfield  (19  80) . 
1-  P  ' 
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APPENDIX 

The  purpose  of  this  appendix  is  to  report  the  results  of 
simulation  tests  of  the  performance  of  Winsorized   t   confidence 
limits  as  compared  to  ordinary  Student's   t.   In  particular, 
comparative  performance  is  reported,  as  measured  by  (i)  confidence 
interval  coverage  (fraction  of  intervals  actually  containing  the 
true  value  of  the  mean) ,  and  (ii)  confidence  interval  average 
(estimate  of  expected)  width. 

o   Simulated  Data.   The  "data"  were  obtained  as  follows. 
Alternatively , 

(a)  Samples  of  size   n  =  20   from  Normal  (0,1) .   The 
random  variable  of  which  the  observations  are 
independent  instances  will  be  called   Z. 

(b)  Samples  of  size   n  =  20   from  the  distribution  of 

hZ2 
Y  =  Z  e    ,   h  >  0.   The   y-values   are  more  spread 

out  (but  still  symmetrically  so)  than  are  the  basic 

z's,   in  order  to  represent  long,  fat,  tails  possibly 

resulting  from  outliers.   The  above  convenient  form 

has  been  suggested  by  Tukey. 

(c)  Samples  of  size  n  =  20   from  the  distribution  of 

a  *  Z 
W  =  (ey   -l)/g',   the  latter  being  recognizable  as 

the  asymmetric  log-Normal  form. 

o   Experimental  Sampling.   Suppose  that  a  sample  is  available 
(from  one  of  the  above  situations,  but  the  exact  source  being 
unknown) ,  compute  confidence  limits  for  the  mean  of  the  parent 
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distribution,  using  (A)  ordinary  Student's   t,   and  (B)  Winsorized 
t,   using  g  =  2,   i.e.  the  lowest  two  values  were  given  the  value 
of  x  t ->\  *      an<3  tne  highest  two  values  the  value  of   x,,q..   Do  so 
repeatedly  (k  =  1000  times  at  present)  and  compare  for  (i)  cover- 
age to  nominal  (here  95%  two-sided)  and  (ii)  average  width  of 
confidence  intervals.   These  parameter  values  were  examined: 

g'  =  0.2,    h  =  1,00,    and   h  =  2.00 
o   Results  of  the  Sampling  Experiment. 


Case 
Normal  (Z) 
Long-Tail  (Y) 
Skewed  (W) 
Long-Tail  (  Y) 
Skewed  (W) 
Long-Tail  (Y) 
Long-Tail  (W) 


2_L    h 
0    0 


0 


Method 


Student 


1.00   Student 


0.2   0 


Student 


0  1.00  Winsor(2) 

0.2  0  Winsor(2) 

0  2.00  Student 

0  2.00  Winsor(2) 


Coverage (%)   Av.  Width 
95.5        0.92 


99.3 
94.8 
96.4 
95.3 
99.8 
98.0 


8658 


0.95 


8.46 


0.98 


4.3x10 


548 


10 


o   Conclusions.   Obvious  indication  of  the  efficacy  of  Winsor- 
izing  long-tailed  (Y)  observations  as  compared  to  traditional 
student's   t.   At  that,  the  level  of  Winsorization  (g  =  2)  is 
probably  too  small,  and  a  larger  value  would  provide  valid 
narrower,  intervals.   Alternatively,  utilize  biweights. 
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